feat(cli): add Kaggle dataset integration and Croissant metadata parsing#11
Conversation
Add new --kaggle and --croissant CLI flags for streamlined dataset workflows: - `toon username/dataset --kaggle` downloads and converts Kaggle datasets to TOON - `toon metadata.json --croissant` parses ML Commons Croissant metadata - `--file` flag to select specific files from multi-file datasets - Auto-detection of Kaggle slugs (username/dataset-name format) New module toon/kaggle.py provides: - download_dataset(): Download Kaggle datasets via kaggle CLI - find_best_csv(): Heuristic selection of main data file - csv_to_records(): CSV to list[dict] conversion - parse_croissant(): Extract schema from Croissant JSON-LD - croissant_to_summary(): Generate human-readable dataset summaries All functions are optional imports - gracefully degrades if kaggle package is not installed. Includes comprehensive test suite (12 tests, 100% pass).
There was a problem hiding this comment.
Pull request overview
Adds Kaggle dataset download support and Croissant (ML Commons) JSON-LD parsing to the TOON tooling so users can go from dataset metadata/slug to TOON output via the CLI (and via a small Python API surface).
Changes:
- Introduces
toon/kaggle.pywith Kaggle CLI download utilities, CSV selection/conversion, and Croissant metadata parsing/summary helpers. - Extends
toonCLI with--kaggle,--croissant, and--file/-fflows to download/parse and then encode to TOON. - Exposes Kaggle/Croissant helpers from
toon/__init__.pyand adds unit tests for the new module.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
toon/kaggle.py |
New Kaggle/Croissant helper module (download via kaggle CLI, CSV heuristics, Croissant schema extraction & summary). |
toon/cli.py |
Adds new CLI flags and execution paths for Kaggle downloads and Croissant parsing. |
toon/__init__.py |
Exports Kaggle/Croissant helpers as part of the public API (attempts “optional” gating). |
tests/test_kaggle.py |
Adds tests for slug detection, CSV conversion, Croissant parsing/summary, and CSV selection heuristic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| from .kaggle import ( | ||
| is_kaggle_slug, | ||
| download_dataset, | ||
| find_best_csv, | ||
| csv_to_records, | ||
| parse_croissant, | ||
| croissant_to_summary, | ||
| ) | ||
| KAGGLE_AVAILABLE = True | ||
| except ImportError: | ||
| KAGGLE_AVAILABLE = False | ||
|
|
There was a problem hiding this comment.
KAGGLE_AVAILABLE is set based on importing toon.kaggle, but that module has no external imports, so this will be True even when the Kaggle CLI isn’t installed/configured. This makes the if not KAGGLE_AVAILABLE: branches effectively dead code and the error message about needing the “kaggle package” misleading. Consider removing this import-gating entirely and instead detect the kaggle executable (e.g., via shutil.which('kaggle')) or rely on download_dataset() raising a clear error, and update the messaging to refer to the Kaggle CLI/credentials rather than the Python package.
|
|
||
|
|
||
| # Handle Kaggle dataset download | ||
| if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)): |
There was a problem hiding this comment.
This condition enables implicit Kaggle-slug auto-detection (treating args.input as Kaggle when it matches username/dataset), even if the user didn’t pass --kaggle. This behavior isn’t described in the PR description/CLI help and can change semantics for relative paths like data/user/file that don’t exist yet. Either require --kaggle explicitly or document the auto-detection behavior clearly (and consider making it opt-in).
| # Handle Kaggle dataset download | ||
| if args.kaggle or (KAGGLE_AVAILABLE and args.input and is_kaggle_slug(args.input)): | ||
| if not KAGGLE_AVAILABLE: | ||
| print('Error: Kaggle support requires the kaggle package. ' | ||
| 'Install with: pip install kaggle', file=sys.stderr) | ||
| return 1 | ||
|
|
||
| try: | ||
| print(f'Downloading Kaggle dataset: {args.input}', file=sys.stderr) | ||
| files = download_dataset(args.input) | ||
|
|
||
| # Find the target file | ||
| if args.select_file: | ||
| target = next( | ||
| (f for f in files if args.select_file in f.name), | ||
| None | ||
| ) | ||
| if not target: | ||
| print(f'Error: No file matching "{args.select_file}" in dataset', | ||
| file=sys.stderr) | ||
| print(f'Available files: {[f.name for f in files]}', file=sys.stderr) | ||
| return 1 | ||
| else: | ||
| target = find_best_csv(files) | ||
| if not target: | ||
| # Try JSON files | ||
| json_files = [f for f in files if f.suffix.lower() == '.json'] | ||
| target = json_files[0] if json_files else None | ||
|
|
||
| if not target: | ||
| print('Error: No CSV or JSON files found in dataset', file=sys.stderr) | ||
| return 1 | ||
|
|
||
| print(f'Using: {target.name}', file=sys.stderr) | ||
|
|
||
| # Read and convert | ||
| content = target.read_text(encoding='utf-8', errors='replace') | ||
|
|
||
| if target.suffix.lower() == '.csv': | ||
| data = csv_to_records(content) | ||
| else: | ||
| data = json.loads(content) | ||
|
|
||
| # Encode to TOON | ||
| options = { | ||
| 'delimiter': args.delimiter, | ||
| 'indent': args.indent, | ||
| 'key_folding': args.key_folding, | ||
| } | ||
| if args.flatten_depth is not None: | ||
| options['flatten_depth'] = args.flatten_depth | ||
|
|
||
| output_content = encode(data, options) | ||
| input_content = json.dumps(data) # For stats comparison | ||
|
|
||
| # Show statistics if requested | ||
| if args.stats: | ||
| input_tokens = count_tokens(input_content) | ||
| output_tokens = count_tokens(output_content) | ||
|
|
||
| print(f'Input (JSON): {len(input_content)} bytes', file=sys.stderr) | ||
| print(f'Output (TOON): {len(output_content)} bytes', file=sys.stderr) | ||
| if len(input_content) > 0: | ||
| print(f'Size reduction: {(1 - len(output_content) / len(input_content)) * 100:.1f}%', | ||
| file=sys.stderr) | ||
|
|
||
| if input_tokens is not None and output_tokens is not None: | ||
| print(f'Input tokens: {input_tokens}', file=sys.stderr) | ||
| print(f'Output tokens: {output_tokens}', file=sys.stderr) | ||
| print(f'Token reduction: {(1 - output_tokens / input_tokens) * 100:.1f}%', | ||
| file=sys.stderr) | ||
| else: | ||
| print('(Install tiktoken for token statistics)', file=sys.stderr) | ||
|
|
||
| print('---', file=sys.stderr) | ||
|
|
||
| write_output(output_content, args.output) | ||
| return 0 | ||
|
|
||
| except Exception as e: | ||
| print(f'Error: {e}', file=sys.stderr) | ||
| return 1 |
There was a problem hiding this comment.
New Kaggle/Croissant CLI flows added here aren’t covered by tests. Since tests/test_cli.py already exists, it would be good to add tests for --kaggle/--croissant behavior (mocking subprocess.run / download_dataset, and verifying file selection + output) to prevent regressions.
|
|
||
| import csv | ||
| import io | ||
| import json |
There was a problem hiding this comment.
import json is unused (only referenced in a docstring example). If the project enforces linting, this will fail; otherwise it’s still noise. Consider removing the import or using json in actual code.
| import json |
| if output_dir is None: | ||
| output_dir = tempfile.mkdtemp(prefix="toon_kaggle_") | ||
|
|
There was a problem hiding this comment.
When output_dir is not provided, this creates a temp directory via mkdtemp() that is never cleaned up by the library or the CLI path (leaving behind toon_kaggle_* directories). Consider using tempfile.TemporaryDirectory() in the CLI flow (so it’s cleaned up), or documenting/returning the temp directory path and providing a cleanup option.
| target = next( | ||
| (f for f in files if args.select_file in f.name), | ||
| None | ||
| ) | ||
| if not target: | ||
| print(f'Error: No file matching "{args.select_file}" in dataset', | ||
| file=sys.stderr) | ||
| print(f'Available files: {[f.name for f in files]}', file=sys.stderr) | ||
| return 1 |
There was a problem hiding this comment.
File selection uses substring matching (args.select_file in f.name) and returns the first match from an arbitrary rglob() ordering. This can select the wrong file when multiple names contain the substring (e.g., data.csv vs mydata.csv). Prefer exact matches first (e.g., f.name == args.select_file), and if multiple matches exist, surface an error listing candidates.
| target = next( | |
| (f for f in files if args.select_file in f.name), | |
| None | |
| ) | |
| if not target: | |
| print(f'Error: No file matching "{args.select_file}" in dataset', | |
| file=sys.stderr) | |
| print(f'Available files: {[f.name for f in files]}', file=sys.stderr) | |
| return 1 | |
| # Prefer exact filename matches first | |
| exact_matches = [f for f in files if f.name == args.select_file] | |
| if len(exact_matches) == 1: | |
| target = exact_matches[0] | |
| elif len(exact_matches) > 1: | |
| print( | |
| f'Error: Multiple files named "{args.select_file}" found in dataset:', | |
| file=sys.stderr, | |
| ) | |
| for f in exact_matches: | |
| print(f' - {f}', file=sys.stderr) | |
| return 1 | |
| else: | |
| # Fall back to substring matches, but require them to be unambiguous | |
| substring_matches = [f for f in files if args.select_file in f.name] | |
| if len(substring_matches) == 1: | |
| target = substring_matches[0] | |
| elif len(substring_matches) > 1: | |
| print( | |
| f'Error: Multiple files matching "{args.select_file}" in dataset:', | |
| file=sys.stderr, | |
| ) | |
| for f in substring_matches: | |
| print(f' - {f}', file=sys.stderr) | |
| return 1 | |
| else: | |
| print( | |
| f'Error: No file matching "{args.select_file}" in dataset', | |
| file=sys.stderr, | |
| ) | |
| print( | |
| f'Available files: {[f.name for f in files]}', | |
| file=sys.stderr, | |
| ) | |
| return 1 |
| data_types = field.get("dataType", ["unknown"]) | ||
| type_str = data_types[0] if data_types else "unknown" | ||
| # Clean up schema.org prefixes | ||
| type_str = type_str.replace("sc:", "").replace("https://schema.org/", "") | ||
|
|
There was a problem hiding this comment.
field.get('dataType') is treated as a list (data_types[0]), but in JSON-LD it can also be a single string. If dataType is a string, this will take the first character (e.g., 'sc:Integer' -> 's') and produce an incorrect schema. Normalize dataType to a list (or handle str explicitly) before selecting the first type.
| # Kaggle integration (optional - requires kaggle installation) | ||
| try: | ||
| from .kaggle import ( | ||
| download_dataset, | ||
| find_best_csv, | ||
| csv_to_records, | ||
| parse_croissant, | ||
| croissant_to_summary, | ||
| is_kaggle_slug, | ||
| ) | ||
| _KAGGLE_AVAILABLE = True | ||
| except ImportError: | ||
| _KAGGLE_AVAILABLE = False | ||
| def download_dataset(*args, **kwargs): | ||
| raise ImportError("download_dataset requires kaggle to be installed. Please install kaggle to use this feature.") | ||
| def find_best_csv(*args, **kwargs): | ||
| raise ImportError("find_best_csv requires kaggle to be installed. Please install kaggle to use this feature.") | ||
| def csv_to_records(*args, **kwargs): | ||
| raise ImportError("csv_to_records requires kaggle to be installed. Please install kaggle to use this feature.") | ||
| def parse_croissant(*args, **kwargs): | ||
| raise ImportError("parse_croissant requires kaggle to be installed. Please install kaggle to use this feature.") | ||
| def croissant_to_summary(*args, **kwargs): | ||
| raise ImportError("croissant_to_summary requires kaggle to be installed. Please install kaggle to use this feature.") | ||
| def is_kaggle_slug(*args, **kwargs): | ||
| raise ImportError("is_kaggle_slug requires kaggle to be installed. Please install kaggle to use this feature.") |
There was a problem hiding this comment.
The try/except ImportError gating here is misleading: toon.kaggle is part of this package and only uses stdlib imports, so this import will succeed regardless of whether the user has the Kaggle CLI installed. As a result, the fallback stubs will never be used, and the error messages about needing to “install kaggle” don’t reflect the actual runtime dependency (the kaggle executable + credentials). Consider removing the ImportError gating and documenting/checking for the Kaggle CLI instead, or explicitly gating on shutil.which('kaggle') if you want a true “availability” switch.
| if not KAGGLE_AVAILABLE: | ||
| print('Error: Croissant support requires the kaggle module.', file=sys.stderr) |
There was a problem hiding this comment.
The --croissant path is gated on KAGGLE_AVAILABLE and the error message says it “requires the kaggle module”, but Croissant parsing here is pure-stdlib (json + parse_croissant) and shouldn’t require the Kaggle CLI. Consider decoupling Croissant support from any Kaggle availability checks and updating the message accordingly (or dropping the check entirely).
| if not KAGGLE_AVAILABLE: | |
| print('Error: Croissant support requires the kaggle module.', file=sys.stderr) | |
| # Croissant support depends on the availability of parse_croissant, | |
| # but does not require the Kaggle CLI itself. | |
| if 'parse_croissant' not in globals(): | |
| print('Error: Croissant support is not available in this installation.', file=sys.stderr) |
| @@ -0,0 +1,190 @@ | |||
| """Tests for Kaggle integration module.""" | |||
|
|
|||
| import pytest | |||
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
|
🎉 This PR is included in version 1.6.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
Description
Add Kaggle dataset integration and Croissant (ML Commons) metadata parsing to streamline dataset-to-TOON workflows. This enables users to download Kaggle datasets and convert them to TOON format in a single command.
Features
New CLI flags:
--kaggle- Treat input as Kaggle dataset slug--croissant- Parse input as Croissant JSON-LD metadata--file / -f- Select specific file from multi-file datasetsUsage examples:
New Python API:
Implementation
New module
toon/kaggle.pyprovides:download_dataset()- Download Kaggle datasets via kaggle CLIfind_best_csv()- Heuristic selection of main data filecsv_to_records()- CSV to list[dict] conversionparse_croissant()- Extract schema from Croissant JSON-LDcroissant_to_summary()- Generate human-readable summariesis_kaggle_slug()- Detect Kaggle dataset slug formatAll imports are optional - gracefully degrades if
kagglepackage is not installed.Type of Change
Testing
Checklist